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Abstract — We consider the problem of neural association for 
a network of non-binary neurons. Here, the task is to first 
memorize a set of patterns using a network of neurons whose 
states assume values from a finite number of integer levels. 
Later, the same network should be able to recall previously 
memorized patterns from their noisy versions. Prior work in this 
area consider storing a finite number of purely random patterns, 
and have shown that the pattern retrieval capacities (maximum 
number of patterns that can be memorized) scale only linearly 
with the number of neurons in the network. 

In our formulation of the problem, we concentrate on exploit- 
ing redundancy and internal structure of the patterns in order to 
improve the pattern retrieval capacity. Our first result shows that 
if the given patterns have a suitable linear-algebraic structure, 
i.e. comprise a sub-space of the set of all possible patterns, then 
the pattern retrieval capacity is in fact exponential in terms of 
the number of neurons. The second result extends the previous 
finding to cases where the patterns have weak minor components, 
i.e. the smallest eigenvalues of the correlation matrix tend toward 
zero. We will use these minor components (or the basis vectors 
of the pattern null space) to both increase the pattern retrieval 
capacity and error correction capabilities. 

An iterative algorithm is proposed for the learning phase, and 
two simple neural update algorithms are presented for the recall 
phase. Using analytical results and simulations, we show that 
the proposed methods can tolerate a fair amount of errors in 
the input while being able to memorize an exponentially large 
number of patterns. 

Index Terms — Neural associative memory. Error correcting 
codes, message passing, stochastic learning, dual-space method 

I. Introduction 

Neural associative memory is a particular class of neural 
networks capable of memorizing (learning) a set of patterns 
and recalling them later in presence of noise, i.e. retrieve 
the correct memorized pattern from a given noisy version. 
Starting from the seminal work of Hopfield in 1982 (T], 
various artificial neural networks have been designed to mimic 
the task of the neuronal associative memory (see for instance 

0, 0, 0, ng, 0). 

In essence, the neural associative memory problem is very 
similar to the one faced in communication systems where the 
goal is to reliably and efficiently retrieve a set of patterns (so 



called codewords) form noisy versions. More interestingly, the 
techniques used to implement an artificial neural associative 
memory looks very similar to some of the methods used in 
graph-based modern codes to decode information. This makes 
the pattern retrieval phase in neural associative memories very 
similar to iterative decoding techniques in modern coding 
theory. 

However, despite the similarity in the task and techniques 
employed in both problems, there is a huge gap in terms 
of efficiency. Using binary codewords of length n, one can 
construct codes that are capable of reliably transmitting 2''" 
codewords over a noisy channel, where < r < 1 is the code 
rate |7|. The optimal r (i.e. the largest possible value that 
permits the almost sure recovery of transmitted codewords 
from the corrupted received versions) depends on the noise 
characteristics of the channel and is known as the Shannon 
capacity ||8j. In fact, the Shannon capacity is achievable 
in certain cases, for example by LDPC codes over AWGN 
channels. 

In current neural associative memories, however, with a 
network of size n one can only memorize 0{n) binary patterns 
of length n 0, 0. To be fair, it must be mentioned that these 
networks are designed such that they are able to memorize any 
possible set of randomly chosen patterns (with size 0{n) of 
course) (e.g., 0, 0, 0, 0). Therefore, although humans 
cannot memorize random patterns, these methods provide 
artificial neural associative memories with a pleasant sense 
of generality. 

However, this generality severely restricts the efficiency of 
the network since even if the input patterns have some internal 
redundancy or structure, current neural associative memories 
could not exploit this redundancy in order to increase the 
number of memorizable patterns or improve error correction 
during the recall phase. In fact, concentrating on redundancies 
within patterns is a fairly new viewpoint. This point of 
view is in harmony to coding techniques where one designs 
codewords with certain degree of redundancy and then use this 
redundancy to correct corrupted signals at the receiver's side. 



In this paper, we focus on bridging the performance gap be- 
tween the coding techniques and neural associative memories. 
Our proposed neural network exploits the inherent structure 
of the input patterns in order to increase the pattern retrieval 
capacity from 0{n) to 0(a") with a > 1. More specifically, 
the proposed neural network is capable of learning and reliably 
recalling given patterns when they come from a subspace with 
dimension k < n of all possible n-dimensional patterns. Note 
that although the proposed model does not have the versatility 
of traditional associative memories to handle any set of inputs, 
such as the Hopfield network |T|, it enables us to boost the 
capacity by a great extent in cases where there is some input 
redundancy. In contrast, traditional associative memories will 
still have linear pattern retrieval capacity even if the patterns 
good linear algebraic structures. 

In ([To), we presented some preliminary results in which two 
efficient recall algorithms were proposed for the case where 
the neural graph had the structure of an expander [llj . Here, 
we extend the previous results to general sparse neural graphs 
as well as proposing a simple learning algorithm to capture 
the internal structure of the patterns (which will be used later 
in the recall phase). 

The remainder of this paper is organized as follows: In 
Section |Il] we will discuss the neural model used in this 
paper and formally define the associative memory problem. 
We explain the proposed learning algorithm in Section III 
Sections |IV] and |V] are respectively dedicated to the recall 
algorithm and analytically investigating its performance in 



retrieving corrupted patterns. In Section VI we address the 



pattern retrieval capacity and show that it is exponential in 
n. Simulation results are discussed in Section IVIII Section 



|VIII| concludes the paper and discusses future research topics. 
Finally, the Appendices contain some extra remarks as well as 
the proofs for certain lemmas and theorems. 

II. Problem Formulation and the Neural Model 
A. The Model 

In the proposed model, we work with neurons whose 
states are integers from a finite set of non-negative values 
Q ~ {0,1,. ..jQ — 1}. A natural way of interpreting this 
model is to think of the integer states as the short-term firing 
rate of neurons (possibly quantized). In other words, the state 
of a neuron in this model indicates the number of spikes fired 
by the neuron in a fixed short time interval. 

Like in other neural networks, neurons can only perform 
simple operations. We consider neurons that can do linear 
summation over the input and possibly apply a non-linear 
function (such as thresholding) to produce the output. More 
specifically, neuron x updates its state based on the states of 
its neighbors {si}"^i as follows: 

1) It computes the weighted sum h = where 
Wi denotes the weight of the input link from the i*'* 
neighbor. 

2) It updates its state as x — f{h), where f : R ^ Q 
is a possibly non-linear function from the field of real 
numbers M to Q. 




Fig. L A bipartite grapli that represents the constraints on the training set. 



We will refer to these two as "neural operations" in the sequel. 

B. The Problem 

The neural associative memory problem consists of two 
parts: learning and pattern retrieval. 

1 ) The learning phase: We assume to be given C vectors of 
length n with integer-valued entries belonging to Q. Further- 
more, we assume these patterns belong to a subspace of Q" 
with dimension k < n. Let Xcxn be the matrix that contains 
the set of patterns in its rows. Note that if fc = n, then we are 
back to the original associative memory problem. However, 
our focus will beon the case where k < n, which will be 
shown to yield much larger pattern retrieval capacities. Let us 
denote the model specification by a triplet {Q,n, k). 

The learning phase then comprises a set of steps to de- 
termine the connectivity of the neural graph (i.e. finding a 
set of weights) as a function of the training patterns in X 
such that these patterns are stable states of the recall process. 
More specifically, in the learning phase we would like to 
memorize the patterns in X by finding a set of non-zero 
vectors wi, . . . , Wm & that are orthogonal to the set 
of given patterns. Remark here that such vectors exist (for 
instance the basis of the null-space). 

Our interest is to come up with a neural scheme to determine 
these vectors. Therefore, the inherent structure of the patterns 
are captured in the obtained null-space vectors, denoted by 
the matrix W G M™^", whose fi^ row is Wi. This matrix 
can be interpreted as the adjacency matrix of a bipartite graph 
which represents our neural network. The graph is comprised 
on pattern and constraint neurons (nodes). Pattern neurons, as 
they name suggest, correspond to the states of the patterns we 
would like to learn or recall. The constrain neurons, on the 
other hand, should verify if the current pattern belongs to the 
database X. If not, they should send proper feedback messages 
to the pattern neurons in order to help them converge to the 
correct pattern in the dataset. The overall network model is 
shown in Figure [T] 

2) The recall phase: In the recall phase, the neural network 
should retrieve the correct memorized pattern from a possibly 
corrupted version. In this case, the states of the pattern neurons 
xi,X2, ■ ■ ■ ,Xn are initialized with the given (noisy) input 
pattern. Here, we assume that the noise is integer valued 



and additivqj Therefore, assuming the input to the network 
is a corrupted version of pattern x'^, the state of the pattern 
nodes are x = + z, where z is the noise. Now the neural 
network should use the given states together with the fact that 
Wx^ = to retrieve pattern x^, i.e. it should estimate z from 
Wx — Wz and return x^ — x—z. Any algorithm designed for 
this purpose should be simple enough to be implemented by 
neurons. Therefore, our objective is to find a simple algorithm 
capable of eliminating noise using only neural operations. 

C. Related Works 

Designing a neural associative memory has been an active 
area of research for the past three decades. Hopfield was 
the first to design an artificial neural associative memory 
in his seminal work in 1982 yj. The so-called Hopfield 
network is inspired by Hebbian learning p2| and is composed 
of binary- valued (±1) neurons, which together are able to 
memorize a certain number of patterns. In our terminology, 
the Hopfield network corresponds to a {{— 1,1}, n,n) neural 
model. The pattern retrieval capacity of a Hopfield network of 
n neurons was derived later by Amit et al. 1 13| and shown to 
be 0.13ri, under vanishing bit error probability requirement. 
Later, McEliece et al. |9| proved that under the requirement 
of vanishing pattern error probability, the capacity of Hopfield 
networks is n/(21og(n))) — 0{n/ \og{n)). 

In addition to neural networks with online learning capa- 
bility, offline methods have also been used to design neural 
associative memories. For instance, in Q the authors assume 
the complete set of pattern is given in advance and calculate 
the weight matrix using the pseudo-inverse rule |14| offline. 
In return, this approach helps them improve the capacity of a 
Hopfield network to n/2, under vanishing pattern error proba- 
bility condition, while being able to correct one bit of error in 
the recall phase. Although this is a significant improvement to 
the n/ log(n) scaling of the pattern retrieval capacity in [9|, it 
comes at the price of much higher computational complexity 
and the lack of gradual learning ability. 

While the connectivity graph of a Hopfield network is a 
complete graph, Komlos and Paturi [15J extended the work 
of McEliece to sparse neural graphs. Their results are of 
particular interest as physiological data is also in favor of 
sparsely interconnected neural networks. They have consid- 
ered a network in which each neuron is connected to d 
other neurons, i.e., a d-regular network. Assuming that the 
network graph satisfies certain connectivity measures, they 
prove that it is possible to store a linear number of random 
patterns (in terms of d) with vanishing bit error probability 
or C = 0{d/ \ogn) random patterns with vanishing pattern 
error probability. Furthermore, they show that in spite of the 
capacity reduction, the error correction capability remains the 
same as the network can still tolerate a number of errors which 
is Unear in n. 

'it must be mentioned that neural states below and above Q — 1 will 
be clipped to and Q ~ 1, respectively. This is biologically justified as the 
firing rate of neurons can not exceed an upper bound and of course can not 
be less than zero. 



It is also known that the capacity of neural associative 
memories could be enhanced if the patterns are of low-activity 
nature, in the sense that at any time instant many of the 
neurons are silent [ [T4| . However, even these schemes fail 
when required to correct a fair amount of erroneous bits as the 
information retrieval is not better compared to that of normal 
networks. 

Extension of associative memories to non-binary neural 
models has also been explored in the past. Hopfield addressed 
the case of continuous neurons and showed that similar to 
the binary case, neurons with states between —1 and 1 can 
memorize a set of random patterns, albeit with less capacity 
|16|. Prados and Kak considered a digital version of non- 
binary neural networks in which neural states could assume 
integer (positive and negative) values | |17J . They show that 
the storage capacity of such networks are in general larger 
than their binary peers. However, the capacity would still be 
less than n in the sense that the proposed neural network can 
not have more than n patterns that are stable states of the 
network, let alone being able to retrieve the correct pattern 
from corrupted input queries. 

In ||3| the authors investigated a multi-state complex-valued 
neural associative memory for which the estimated capacity is 
C < 0.15n. Under the same model but using a different learn- 
ing method, Muezzinoglu et al. Q showed that the capacity 
can be increased to C = n. However the complexity of the 
weight computation mechanism is prohibitive. To overcome 
this drawback, a Modified Gradient Descent learning Rule 
(MGDR) was devised in |18j . In our terminology, all these 
models are ({e^'^-'^/'^|0 < s < k — l},n,n) neural associative 
memories. 

Given that even very complex offline learning methods can 
not improve the capacity of binary or multi-sate neural associa- 
tive memories, a group of recent works has made considerable 
efforts to exploit the inherent structure of the patterns in order 
to increase capacity and improve error correction capabilities. 
Such methods focus merely on memorizing those patterns that 
have some sort of inherent redundancy. As a result, they differ 
from previous methods in which the network was deigned to 
be able to memorize any random set of patterns. Pioneering 
this approach, Berrou and Gripon (19] achieved considerable 
improvements in the pattern retrieval capacity of Hopfield 
networks, by utilizing Walsh-Hadamard sequences. Walsh- 
Hadamard sequences are a particular type of low correlation 
sequences and were initially used in CDMA communications 
to overcome the effect of noise. The only slight downside 
to the proposed method is the use of a decoder based on 
the winner-take-all approach which requires a separate neural 
stage, increasing the complexity of the overall method. Using 
low correlation sequences has also been considered in |5|, 
where the authors introduced two novel mechanisms of neural 
association that employ binary neurons to memorize patterns 
belonging to another type of low correlation sequences, called 
Gold family [20]. The network itself is very similar to that of 
Hopfield, with a slightly modified weighting rule. Therefore, 
similar to a Hopfield network, the complexity of the learning 



phase is small. However, the authors failed to increase the 
pattern retrieval capacity beyond n and it was shown that the 
pattern retrieval capacity of the proposed model is C = n, 
while being able to correct a fair number of erroneous input 
bits. 

Later, Gripon and Berrou came up with a different approach 
based on neural cliques, which increased the pattern retrieval 
capacity to 0{n^) |6|. Their method is based on dividing a 
neural network of size n into c clusters of size n/c each. Then, 
the messages are chosen such that only one neuron in each 
cluster is active for a given message. Therefore, one can think 
of messages as a random vector of length clog(n/c), where 
the log (n/c) part specifies the index of the active neuron in a 
given cluster. The authors also provide a learning algorithm, 
similar to that of Hopfield, to learn the pair-wise correlations 
within the patterns. Using this technique and exploiting the fact 
that the resulting patterns are very sparse, they could boost 
the capacity to 0{n^) while maintaining the computational 
simplicity of Hopfield networks. 

In contrast to the pairwise correlation of the Hopfield model, 
Peretto et al. | ,21 1 deployed higher order neural models: the 
models in which the state of the neurons not only depends 
on the state of their neighbors, but also on the correlation 
among them. Under this model, they showed that the storage 
capacity of a higher-order Hopfield network can be improved 
to C = 0(nP~^), where p is the degree of correlation 
considered. The main drawback of this model is the huge 
computational complexity required in the learning phase, as 
one has to keep track of 0{nP^^) neural links and their 
weights during the learning period. 

Recently, the present authors introduced a novel model 
inspired by modern coding techniques in which a neural 
bipartite graph is used to memorize the patterns that belong 
to a subspace | [T0| . The proposed model can be also thought 
of as a way to capture higher order correlations in given 
patterns while keeping the computational complexity to a 
minimal level (since instead of 0(n^~^) weights one needs 
to only keep track of 0{n^) of them). Under the assumptions 
that the bipartite graph is known, sparse, and expander, the 
proposed algorithm increased the pattern retrieval capacity to 
C — 0(a"), for some a > 1, closing the gap between the 
pattern retrieval capacities achieved in neural networks and 
that of coding techniques. For completeness, this approach is 
presented in the appendix (along with the detailed proofs). The 
main drawbacks in the proposed approach were the lack of a 
learning algorithm as well as the expansion assumption on the 
neural graph. 

In this paper, we focus on extending the results described 



in 1 10 1 in several directions: first, we will suggest an iterative 



learning algorithm, to find the neural connectivity matrix from 
the patterns in the training set. Secondly, we provide an 
analysis of the proposed error correcting algorithm in the recall 
phase and investigate its performance as a function of input 
noise and network model. Finally, we discuss some variants of 
the error correcting method which achieve better performance 
in practice. 



It is worth mentioning that an extension of this approach 
to a multi-level neural network is considered in ||22|. There, 
the novel structure enables better error correction. However, 
the learning algorithm lacks the ability to learn the patterns 
one by one and requires the patterns to be presented all 
at the same time in the form of a big matrix. In |23| we 
have further extended this approach to a modular single-layer 
architecture with online learning capabilities. The modular 
structure makes the recall algorithm much more efficient while 
the online learning enables the network to learn gradually from 
examples. The learning algorithm proposed in this paper is also 



virtually the same as the one we proposed in |23|, giving it 
the advantage of 

Another important point to note is that learning linear 
constraints by a neural network is hardly a new topic as one 
can learn a matrix orthogonal to a set of patterns in the training 
set (i.e., Wx'^ = 0) using simple neural learning rules (we 
refer the interested readers to {24\ and |25|). However, to 
the best of our knowledge, finding such a matrix subject to 
the sparsity constraints has not been investigated before. This 
problem can also be regarded as an instance of compressed 
sensing p6| , in which the measurement matrix is given by 
the big patterns matrix Xcxn and the set of measurements 
are the constraints we look to satisfy, denoted by the tall 
vector b, which for simplicity reasons we assume to be all 
zero. Thus, we are interested in finding a sparse vector w 
such that Xw = 0. Nevertheless, many decoders proposed 
in this area are very compUcated and cannot be implemented 
by a neural network using simple neuron operations. Some 
exceptions are |27J and [28 J which are closely related to the 
learning algorithm proposed in this paper. 

D. Solution Overview 

Before going through the details of the algorithms, let us 
give an overview of the proposed solution. To leam the set of 
given patterns, we have adopted the neural learning algorithm 
proposed in p9) and modified it to favor sparse solutions. 
In each iteration of the algorithm, a random pattern from 
the data set is picked and the neural weights corresponding 
to constraint neurons are adjusted is such a way that the 
projection of the pattern along the current weight vectors is 
reduced, while trying to make the weights sparse as well. 

In the recall phase, we exploit the fact that the learned 
neural graph is sparse and orthogonal to the set of patterns. 
Therefore, when a query is given, if it is not orthogonal to the 
connectivity matrix of the weighted neural graph, it is noisy. 
We will use the sparsity of the neural graph to eliminate this 
noise using a simple iterative algorithm. In each iteration, there 
is a set of violated constraint neurons, i.e. those that receive 
a non-zero sum over their input links. These nodes will send 
feedback to their corresponding neighbors among the pattern 
neurons, where the feedback is the sign of the received input- 
sum. At this point, the pattern nodes that receive feedback from 
a majority of their neighbors update their state according to the 
sign of the sum of received messages. This process continues 
until noise is eliminated completely or a failure is declared. 



In short, we propose a neural network with onHne learning 
capabilities which uses only neural operations to memorize an 
exponential number of patterns. 

III. Learning Phase 

Since the patterns are assumed to be coming from a sub- 
space in the n-dimensional space, we adapt the algorithm 
proposed by Oja and Karhunen | |29| to learn the null-space 
basis of the subspace defined by the patterns. In fact, a very 
similar algorithm is also used in | j24j for the same purpose. 
However, since we need the basis vectors to be sparse (due 
to requirements of the algorithm used in the recall phase), we 
add an additional term to penalize non-sparse solutions during 
the learning phase. 

Another difference with the proposed method and that of 
is that the learning algorithm proposed in |24| yields 
dual vectors that form an orthogonal set. Although one can 
easily extend our suggested method to such a case as well, we 
find this requirement unnecessary in our case. This gives us 
the additional advantage to make the algorithm parallel and 
adaptive. Parallel in the sense that we can design an algorithm 
to learn one constraint and repeat it several times in order to 
find all constraints with high probability. And adaptive in the 
sense that we can determine the number of constraints on- 
the-go, i.e. start by learning just a few constraints. If needed 
(for instance due to bad performance in the recall phase), the 
network can easily learn additional constraints. This increases 
the flexibility of the algorithm and provides a nice trade-off 
between the time spent on learning and the performance in the 
recall phase. Both these points make an approach biologically 
realistic. 

It should be mentioned that the core of our learning algo- 
rithm here is virtually the same as the one we proposed in 
l|23l. 



A. Overview of the proposed algorithm 

The problem to find one sparse constraint vector w is given 



by equations ( lai, ( Ibi, in which pattern /i is denoted by 



min s\x^ 



r]g{w). 



subject to: 



Ikll2 = i 



(la) 



(lb) 



In the above problem, • is the inner-product, ||.||2 represent the 
£2 vector norm, g{w) a penalty function to encourage sparsity 
and is a positive constant. There are various ways to choose 
g{w). For instance one can pick g{w) to be ||.|ji, which leads 
to £i-norm penalty and is widely used in compressed sensing 
applications |27|, p8J. Here, we will use a different penalty 



function, as explained later. 

To form the basis for the null space of the patterns, we need 
m = n — k vectors, which we can obtain by solving the above 



problem several times, each time from a random initial poinj^ 
As for the sparsity penalty term g{w) in this problem, in 
this paper we consider the function 



9{w) = ^tanh( 



1=1 



where a is chosen appropriately. Intuitively, ianh.{cnu1) ap- 
proximates |sign(it;i)| in £o-norm. Therefore, the larger a is, 
the closer g{w) will be to ||.||o. By calculating the derivative 
of the objective function, and by considering the update due 
to each randomly picked pattern x, we will get the following 
iterative algorithm: 



y{t) = xit) ■ wit) 
w{t + 1) = wit) - at i2y{t)x{t) + V^{w{t))) 



w{t + l) = 



w{t + 1) 
\w{t + l)h 



(2a) 
(2b) 



(2c) 



In the above equations, t is the iteration number, x{t) is the 
sample pattern chosen at iteration t uniformly at random from 
the patterns in the training set X, and at is a small positive 
constant. Finally, T{w) : TZ" — > TZ" = Wg{w) is the gradient 
of the penalty term for non-sparse solutions. This function has 
the interesting property that for very small values of Wi{t), 
r{wi{t)) ~ 2aWi{t). To see why, consider the i*'* entry of the 
function r{w{t))) 

r,{w{t)) = dgiw{t))/dw^{t) = 2atw,{t){l^t(iiih\aw,{t)^)) 

It is easy to see that ri{w{t)) ~ 2awi{t) for relatively small 
Wi{tys. And for larger values of Wi{t), we get ri{w{t)) ~ 
(see Figure [2]i. Therefore, by proper choice of 77 and <t, 
equation ( [2b| suppresses small entries of w{t) by pushing them 
towards zero, thus, favoring sparser results. To simplify the 
analysis, with some abuse of notation, we approximate the 
function r{w'-^\t)) with the following function: 



Uw^'Ht)) 







^Wf'>it)\<Ou 
otherwise, 



(3) 



where 9t is a small positive threshold. 

Following the same approach as p9) and assuming at to 
be small enough such that equation (|2c[) can be expanded 
as powers of at, we can approximate equation Q with the 
following simpler version: 



w{t+l) = w{t)-at (y{t) I x{t) 



(4a) 



x{t) ■ w{t) 

(4b) 

In the above approximation, we also omitted the term 
ttfTy {w{t) ■ T{w{t))) w{t) since w{t) ■ r{w{t)) would be neg- 
ligible, specially as 9t in equation (j3]l becomes smaller 

^It must be mentioned that in order to have exactly m = n — k linearly 
independent vectors, we should pay some additional attention when repeating 
the proposed method several time. This issue is addressed later in the paper. 




Fig. 2. The sparsity penalty Ti{wi). which suppresses small values of the 
entry of w in each iteration as a function of Wi and a. Note that the 
normalization constant 2cr has been omitted here to make comparison with 
function f = Wi possible. 



Algorithm 1 Iterative Learning 

Input: Set of patterns E X with /i = 1, . . . , C, stopping 

point e. 
Output: w 

while • > e do 

Choose x{t) at random from patterns in X 
Compute y{t) = x{t) ■ w{t) 

Update wit + 1) = w{t) atyit) [x{t) - gg^' 
at7]T{w{t)). 

end while 



The overall learning algorithm for one constraint node is 
given by Algorithm [T] In words, in Algorithm [T] y{t) is the 
projection of x{t) on the basis vector w{t). If for a given 
data vector x{t), y{t) is equal to zero, namely, the data is 
orthogonal to the current weight vector w{t), then according to 



equation (4b i the weight vector will not be updated. However, 
if the data vector x{t) has some projection over w{t) then the 
weight vector is updated towards the direction to reduce this 
projection. 

Since we are interested in finding m basis vectors, we have 
to do the above procedure at least m times in parallel]^ 

Remark 1. Although we are interested in finding a sparse 
graph, note that too much sparseness is not desired. This is 
because we are going to use the feedback sent by the constraint 

'in practice, we may have to repeat this process more than m times to 
ensure the existence of a set of m linearly independent vectors. However, our 
experimental results suggest that most of the time, repeating m times would 
be sufficient. 



'es to eliminate input noise at pattern nodes during the 
'ecall phase. Now if the graph is too sparse, the number 
jf feedback messages received by each pattern node is too 
small to be relied upon. Therefore, we must adjust the penalty 
efficient rj such that resulting neural graph is sufficiently 
spl^rse. In the section on experimental results, we compare 
error correction performance for different choices of rj. 

Convergence analysis 

[n order to prove that Algorithm [T] converges to the proper 
solution, we use results from statistical learning. More specifi- 
cajly, we benefit from the convergence of Stochastic Gradient 
Descent (SGD) algorithms pO) . To prove the convergence, 
E(w) — J2fi 1'^'^ ' ""^P be the cost function we would 
like to minimize. Furthermore, let A = E{a;a;"'"|a; G X} 
the corelation matrix for the patterns in the training set. 
TMerefore, due to uniformity assumption for the patterns in the 
training set, one can rewrite E{w) — w^Aw. Finally, denote 
Af^ — Now consider the following assumptions: 

Al. ||A||2 < T < (X) and sup^ \\AJ2 = Ik'^lP < C < 00. 
A2. «( > 0, ^ Oft — >■ cx) and "^t < where at is the 
small learning rate defined in |2] 
The following lemma proves the convergence of Algorithm 
[l]to a local minimum w* . 

Lemma 1. Let assumptions Al and A2 hold. Then, Algorithm 
^converges to a local minimum w* for which \7E{w*) = 0. 

Proof: To prove the lemma, we use the convergence 
results in [30) and show that the required assumptions to 
ensure convergence holds for the proposed algorithm. For 
simplicity, these assumptions are listed here: 

1) The cost function E{w) is three-times differentiable with 
continuous derivatives. It is also bounded from below. 

2) The usual conditions on the learning rates are fulfilled, 
i.e. X at = 00 and ^ < 00. 

3) The second moment of the update term should not grow 
more than linearly with size of the weight vector. In 
other words, 

E{w) < a- 



b\Ml 



for some constants a and b. 

4) When the norm of the weight vector w is larger 
than a certain horizon D, the opposite of the gradient 
—WE{W) points towards the origin. Or in other words: 

inf ||w||2 > Dw ■ S/E{w) > 

5) When the norm of the weight vector is smaller than 
a second horizon F, with F > D, then the norm 
of the update term {2y{t)x{t) + r]r{w{t))) is bounded 
regardless of x{t). This is usually a mild requirement: 



yx{t) e X, 



sup 

\\wl\2<F 



{2y{t)x{t) + 7jr{w{t)))h<Ko 



To start, assumption 1 holds trivially as the cost func- 
tion is three-times differentiable, with continuous derivatives. 
Furthermore, E{w) > 0. Assumption 2 holds because of 



our choice of the step size at, as mentioned in the lemma 
description. 

Assumption 3 ensures that the vector w could not escape by 
becoming larger and larger. Due to the constraint ||w||2 = 1, 
this assumption holds as well. 

Assumption 4 holds as well because: 

+ 4?7ii;^E^(A^)r(u.) 

< +7f\\w\\l+^ijT\\w\\l 
= \\w\\l{Ae +Af^T + if) (5) 

Finally, assumption 5 holds because: 

\\2A^w + i^T{w)\\l = Aw'^Alw + Tf\\T{w)\\l 

< \\w\\l{AC^ +Af^C + V^) (6) 
Therefore, 3F > D such that as long as ||w||2 < F: 

sup \\2Af,w + il^{w)\\l < (2C + TjfF = constant (7) 

\\w\\l<E 

Since all necessary assumptions hold for the learning algo- 
rithm[T] it converges to a local minimum where VE^w*) = 0. 

■ 

Next, we prove the desired result, i.e. the fact that at the 
local minimum, the resulting weight vector is orthogonal to 
the patterns, i.e. Aw = 0. 

Theorem 2. In the local minimum where \JE[w*) — 0, the 
optimal vector w* is orthogonal to the patterns in the training 
set. 

Proof: Since \/E{w*) = 2Aw* + riT{-w*) = 0, we have: 



VE{w*) = 2{'w*fAw* + TTW* ■ r{w*) 



(8) 



The first term is always greater than or equal to zero. Now 
as for the second term, we have that |r(u'i)| < \wi\ and 
sign{wi) — sign(r(wi)), where Wi is the i*'* entry of w. 
Therefore, < w* • T{w*) < ||w*||2. Therefore, both terms 
on the right hand side of ([8]l are greater than or equal to zero. 
And since the left hand side is known to be equal to zero, we 
conclude that {w*)'^Aw* = and r(w*) = 0. The former 



means (w*) Aw* 



c^)'^ — 0. Therefore, we must 



have w* -x^ = 0, for all /i = 1, . . . , C. This simply means that 
the vector w* is orthogonal to all the patterns in the training 
set. ■ 

Remark 2. Note that the above theorem only proves that 
the obtained vector is orthogonal to the data set and says 
nothing about its degree of sparsity. The reason is that there 
is no guarantee that the dual basis of a subspace be sparse. 
The introduction of the penalty function g{w) in problem ([7]) 
only encourages sparsity by suppressing the small entries of 
w, i.e. shifting them towards zero if they are really small or 
leaving them intact if they are rather large. And from the 
fact that T{w*) — 0, we know this is true as the entries in 
w* are either large or zero, i.e. there are no small entries. 



Our experimental results in section VII show that in fact this 
strategy works perfectly and the learning algorithm results in 
sparse solutions. 



C. Avoiding the all-zero solution 

Although in problem ^ we have the constraint ||u'||2 = 
1 to make sure that the algorithm does not converge to the 
trivial solution w — 0, due to approximations we made when 
developing the optimization algorithm, we should make sure 
to choose the parameters such that the all-zero solution is still 
avoided. 

To this end, denote w'{t) = w{t)-atv{t) (x{t) - j^^^ 
and consider the following inequalities: 



\\w{t + l)\\l = \\w{t) - aty{t) [ x{t) 



> 



yit)wit) 

\Hmi 
atvnwmii 

||u;'(i)|p + «,Vl|rMt))|p 

2atr]riwit))-w'it) 
\\w'{t)\\l-2atrjT{w{t))-w'{t) 



(9) 



Now in order to have \\w{t + > 0, we must have 

that 2atr]\T{w{t))^w'it)\ < \\w'it)\\l Given that, \T{w{t)) ■ 
w'{t)\ < ||ti;'(i)||2||r('u;(t))||2, it is therefore sufficient to have 
2atr]\\r{w{t))\\2 < |lw'(t)||2. On the other hand, we have: 



\wmi = \\w{t)\\i+aUtrMt) 



yit)w{t) 



> 



\wit)\ 



(10) 



As a result, in order to have ||ix;(t + 1)||2 > 0, it is sufficient 
to have 2Qt?7||r(?i'(t))||2 < ||w(i)||2- Finally, since we have 
|r(w(t))| < \w{t)\ (entry-wise), we know that ||r(w(t))||2 < 
\\w{t)\\2. Therefore, having 2atr] < 1 < ||ii;(<)||2/||r(u;(t))||2 
ensures ||u)(i)||2 > 0. 

Remark 3. Interestingly, the above choice for the function w— 



riT{w) looks very similar to the soft thresholding function (11 ) 



introduced in f^Tj to perform iterative compressed sensing. 
The authors show that their choice of the sparsity function is 
very competitive in the sense that one can not get much better 
results by choosing other thresholding functions. However, one 
main difference between their work and that of ours is that 
we enforce the sparsity as a penalty in equation \2b\ while 
they apply the soft thresholding function in equation \11\ to 
the whole w, i.e. if the updated value of w is larger than a 
threshold, it is left intact while it will be put to zero otherwise. 




if x> 9t\ 
t if X < ~9t 
otherwise. 



(11) 



where 9t is the threshold at iteration t and tends to zero as t 
grows. 



D. Making the Algorithm Parallel 

In order to find m constraints, we need to repeat Algorithm 
[T| several times. Fortunately, we can repeat this process in 
parallel, which speeds up the algorithm and is more mean- 
ingful from a biological point of view as each constraint 
neuron can act independently of other neighbors. Although 
doing the algorithm in parallel may result in linearly dependent 
constraints once in a while, our experimental results show that 
starting from different random initial points, the algorithm 
converges to different distinct constraints most of the time. 
And the chance of getting redundant constraints reduces if we 
start from a sparse random initial point. Besides, as long as 
we have enough distinct constraints, the recall algorithm in the 
next section can start eliminating noise and there is no need 
to learn all the distinct basis vectors of the null space defined 
by the training patterns (albeit the performance improves as 
we learn more and more linearly independent constraints). 
Therefore, we will use the parallel version to have a faster 
algorithm in the end. 

IV. Recall Phase 

In the recall phase, we are going to design an iterative 
algorithm that corresponds to message passing on a graph. 
The algorithm exploits the fact that our learning algorithm 
resulted in the connectivity matrix of the neural graph which 
is sparse and orthogonal to the memorized patterns. There- 
fore, given a noisy version of the learned patterns, we can 
use the feedback from the constraint neurons in Fig. [T] to 
eUminate noise. More specifically, the linear input sums to 
the constraint neurons are given by the elements of the vector 
W{x^^ + z) = Wx'^ + Wz = Wz, with z being the integer- 
valued input noise (biologically speaking, the noise can be 
interpreted as a neuron skipping some spikes or firing more 
spikes than it should). Based on observing the elements of 
Wz, each constraint neuron feeds back a message (containing 
info about z) to its neighboring pattern neurons. Based on this 
feedback, and exploiting the fact that W is sparse, the pattern 
neurons update their states in order to reduce the noise z. 

It must also be mentioned that we initially assume assymet- 
ric neural weights during the recall phase. More specifically, 
we assume the backward weight from constraint neuron i to 
pattern neuron j, denoted by Wfj be equal to the sign of 
the weight from pattern neuron i to constraint neuron j, i.e. 
W^^ — sign(VFij), where sign(x) is equal to +1, or —1 
if X > 0, a: = or a; < 0, respectively. This assumption 



simplifies the error correction analysis. Later in section IV-B 



we are going to consider another version of the algorithm 
which works with symmetric weights, i.e. = Wy, and 
compare the performance of all suggested algorithms together 
in section FVIll 



A. The Recall Algorithms 

The proposed algorithm for the recall phase comprises 
a series of forward and backward iterations. Two different 
methods are suggested in this paper, which slightly differ from 
each other in the way pattern neurons are updated. The first 



Algorithm 2 Recall Algorithm: Winner- Take-All 
Input: Connectivity matrix W, iteration i^ax 
Output: 

L for t = 1 ^ iinax dO 

2: Forward iteration: Calculate the weighted input sum 
hi = X]J=i ^ij^j' for s^'^h constraint neuron yi and 
set: 

1, hi<0 
yi= \ 0, hi^O 
- 1 , otherwise 

3: Backward iteration: Each neuron Xj with degree dj 
computes 



Find 



'57 



J — argmaxf; 



(2) 
3 ■ 



5: Update the state of winner j*: set 
sign(g|i^). 



t 



t+1 



1: end for 



one is based on the Winner- Take-All approach (WTA) and 
is given by Algorithm [2] In this version, only the pattern 
node that receives the highest amount of normalized feedback 
updates its state while the other pattern neurons maintain 
their current states. The normalization is done with respect 
to the degree of each pattern neuron, i.e. the number of edges 
connected to each pattern neuron in the neural graph. The 
winner-take-all circuitry can be easily added to the neural 
model shown in Figure [T] using any of the classic WTA 
methods \IA\. 

The second approach, given by Algorithm [3] is much 
simpler: in every iteration, each pattern neuron decides locally 
whether or not to update its current state. More specifically, if 
the amount of feedback received by a pattern neuron exceeds 
a threshold, the neuron updates its state; otherwise, it remains 
unchangedr In both algorithms, the quantity g'^p can be 
interpreted as the number of feedback messages received by 
pattern neuron Xj from the constraint neurons. On the other 
hand, the sign of g^p provides an indication of the sign of the 

noise that affects Xj, and Igj^'l indicates the confidence level 
in the decision regarding the sign of the noise. 

It is worthwhile mentioning that the Majority- Voting decod- 
ing algorithm is very similar to the Bit-Flipping algorithm of 
Sipser and Spielman to decode LDPC codes |31 1 and a similar 
approach in |32| for compressive sensing methods. 



Remark 4. To give the reader some insight about why 

"^Note that in order to maintain the current value of a neuron in case no 
input feedback is received, we can add self-loops to pattern neurons in Figure 
[T| These self-loops are not shown in the figure for claiity. 



Algorithm 3 Recall Algorithm: Majority- Voting 

Input: Connectivity matrix W, threshold (p, iteration ij^ax 

Output: Xi,X2, ■ ■ . ,Xn 

1: for t = 1 ^ imax do 

2: Forward iteration: Calculate the weighted input sum 



hi = X]j=i ^ijXjj for s^'^h neuron yi and set: 




1, otherwise 



3: Backward iteration: Each neuron Xj with degree dj 
computes 



9 



dj J dj 



4: Update the state of each pattern neuron j according to 

Xj ^ Xj + sign(.gj) only if \gf\ > </?• 

5: t ^ t + l 

6: end for 



the neural graph should be sparse in order for the above 
algorithms to work, consider the backward iteration of both 
algorithms: it is based on counting the fraction of received 
input feedback messages from the neighbors of a pattern 
neuron. In the extreme case, if the neural graph is complete, 
then a single noisy pattern neuron results in the violation of all 
constraint neurons in the forward iteration. As a result, in the 
backward iteration all the pattern neurons receive feedback 
from their neighbors and it is impossible to tell which of the 
pattern neuron is the noisy one. 

However, if the graph is sparse, a single noisy pattern 
neuron only makes some of the constraints unsatisfied. Con- 
sequently, in the recall phase only the nodes which share the 
neighborhood of the noisy node receive input feedbacks. And 
the fraction of the received feedbacks would be much larger 
for the original noisy node. Therefore, by merely looking at the 
fraction of received feedback from the constraint neurons, one 
can identify the noisy pattern neuron with high probability as 
long as the graph is sparse and the input noise is reasonable 
bounded. 

B. Some Practical Modifications 

Although algorithm |3] is fairly simple and practical, each 
pattern neuron still needs two types of information: the number 
of received feedbacks and the net input sum. Although one can 
think of simple neural architectures to obtain the necessary 
information, we can modify the recall algorithm to make it 
more practical and simpler The trick is to replace the degree 
of each node Xj with the ^i-norm of the outgoing weights. 



In other words, instead of using 



= dj, we use 



Furthermore, we assume symmetric weights, i.e VK/^ — Wij. 

Interestingly, in some of our experimental results corre- 
sponding to denser graphs, this approach performs much 



reason behind this improvement might be the fact that using 
the £i-norm instead of the ^o-norm in |3] will result in better 
differentiation between two vectors that have the same number 
of non-zero elements, i.e. have equal ^o-norms, but differ from 
each other in the magnitude of the element, i.e. their li- 
norms differ Therefore, the network may use this additional 
information in order to identify the noisy nodes in each update 
of the recall algorithm. 

V. Performance Analysis 

In order to obtain analytical estimates on the recall prob- 
ability of error, we assume that the connectivity graph W is 
sparse. With respect to this graph, we define the pattern and 
constraint degree distributions as follows. 

Definition 1. For the bipartite graph W, let Xi (pj) denote 
the fraction of edges that are adjacent to pattern (constraint) 
nodes of degree i (j). We call {Ai, . . . , Am} and {pi, . . . , p„} 
the pattern and constraint degree distribution form the edge 
perspective, respectively. Furthermore, it is convenient to 
define the degree distribution polynomials as 



\{z) = AiZ* ^ and p{z) = 



PlZ 



better, as will be illustrated in section VII One possible 



The degree distributions are determined after the learning 
phase is finished and in this section we assume they are 
given. Furthermore, we consider an ensemble of random 
neural graphs with a given degree distribution and investigate 
the average performance of the recall algorithms over this 
ensemble. Here, the word "ensemble" refers to the fact that 
we assume having a number of random neural graphs with the 
given degree distributions and do the analysis for the average 
scenario. 

To simplify analysis, we assume that the noise entries are 
±1. However, the proposed recall algorithms can work with 
any integer-valued noise and our experimental results suggest 
that this assumption is not necessary in practice. 

Finally, we assume that the errors do not cancel each other 
out in the constraint neurons (as long as the number of 
errors is fairly bounded). This is in fact a realistic assumption 
because the neural graph is weighted, with weights belonging 
to the real field, and the noise values are integers. Thus, the 
probability that the weighted sum of some integers be equal 
to zero is negligible. 

We do the analysis only for the Majority-Voting algorithms 
since if we choose the Majority-Voting update threshold 
ip — 1, roughly speaking, we will have the winner-take-all 
algorithm]^ 

As mentioned earlier, in this paper we will perform the 
analysis for general sparse bipartite graphs. However, restrict- 
ing ourselves to a particular type of sparse graphs known as 
"expander" allows us to prove stronger results on the recall 

^It must be mentioned that choosing ip = 1 does not yield the WTA 
algorithm exactly because in the original WTA, only one node is updated 
in each round. However, in this version with ip = 1, all nodes that receive 
feedback from all their neighbors are updated. Nevertheless, the performance 
of the both algorithms is rather similar. 



error probabilities. More details can be found in Appendix |C] 
and in 1 10|. However, since it is very difficult, if not impossible 
in certain cases, to make a graph expander during an iterative 
learning method, we focus on the more general case of sparse 
neural graphs. 

To start the analysis, let 8t denote the set of erroneous 
pattern nodes at iteration t, and N{£t) be the set of constraint 
nodes that are connected to the nodes in £t, i.e. these are 
the constraint nodes that have at least one neighbor in £t. 
In addition, let M'^{£t) denote the (complimentary) set of 
constraint neurons that do not have any connection to any 
node in £f Denote also the average neighborhood size of £t 
by St = E(|7V(ft)|). Finally, letC* be the set of correct pattern 
nodes. 

Based on the error correcting algorithm and the above 
notations, in a given iteration two types of error events are 
possible: 

1) Type-1 error event: A node x e Ct decides to update its 
value. The probability of this phenomenon is denoted by 

2) Type-2 error event: A node x ^ £t updates its value in 
the wrong direction. Let Pe^t) denote the probability 
of error for this type. 

We start the analysis by finding explicit expressions and 
upper bounds on the average of Pgj (t) and P^^ (t) over all 
nodes as a function St- We then find an exact relationship 
for St as a function of \£t\, which will provide us with the 
required expressions on the average bit error probability as a 
function of the number of noisy input symbols, |£o|- Having 
found the average bit error probability, we can easily bound 
the block error probability for the recall algorithm. 

A. Error probability - type 1 

To begin, let Pf{t) be the probability that a node x £ Ct 
with degree d,j. updates its state. We have: 

P..(„ = Pr(^ain^>rf ,,2, 

where M{x) is the neighborhood of x. Assuming random 
construction of the graph and relatively large graph sizes, one 
can approximate Pf (i) by 



Pf(i) 



E 



dx\ f S, 



1 - 



St 



(13) 



In the above equation, St/m represents the probabaility of 
having one of the edges connected to the St constraint 
neurons that are neighbors of the erroneous pattern neurons. 
As a result of the above equations, we have: 



Pe,(t)=E,^(Pf(i)), 



(14) 



where E^^ denote the expectation over the degree distribution 
{Ai, . . . , A„i}. 
Note that if 1^9 = 1, the above equation simplifies to 

Pei(t) = A' ^* 



B. Error probability - type 2 

A node x E £t makes a wrong decision if the net 
input sum it receives has a different sign than the sign of 
noise it experiences. Instead of finding an exact relation, we 
bound this probability by the probability that the neuron x 
shares at least half of its neighbors with other neurons, i.e. 

£t \ X. 



Pe,{t) < PrjM^IilnA^ > 1/2}, where £*t 
Letting P^{t) = Pr{M£^^g:VMi > l/2|deg(a;) = d^}, we 
will have: 



i=\d^/2'] 

where5;=E(|AA(£;)|) 
Therefore, we will have: 



s 



s. 



(15) 



Pe.W <Erf^(P|(i)) 



(16) 



Combining equations ( [T4| and ( fTS] ), the bit error probability 
at iteration t would be 



Pb(t+1) 



Pr{x e Ct}PeAt)+Pr{x e £t}Pedt) 



n - \£t 



Pe^(t) 



\£t 



Pedt) 



(17) 



And finally, the average block error rate is given by the 
probability that at least one pattern node x is in error There- 
fore: 

Pe(t) = 1-(1- Pb(t)r (18) 



Equation (18 1 gives the probability of making a mistake in 



iteration t. Therefore, we can bound the overall probability of 
error, Pg, by setting Pe — \imt^ 00 Pe{t)- To this end, we 



have to recursively update Pb(t) in equation (17i and using 
\£t+i\ ~ nPi,{t +1). However, since we have assumed that 
the noise values are ±1, we can provide an upper bound on 
the total probabihty of error by considering 



Pe < Pe(l) 



(19) 



In other words, we assume that the recall algorithms either 
correct the input error in the first iteration or an error is 
declared. Obviously, this bound is not tight as in practice 
and one might be able to correct errors in later iterations. In 
fact simulation results confirm this expectation. However, this 
approach provides a nice analytical upper bound since it only 
depends on the initial number of noisy nodes. As the initial 
number of noisy nodes grow, the above bound becomes tight. 
Thus, in summary we have: 



n n 



(20) 



where Pf — Ej^^jPf } and \£q\ is the number of noisy nodes 
in the input pattern initially. 

Remark 5. One might hope to further simplify the above 
inequalities by finding closed form approximation of equations 



{13} and ( [75^ . However, as one expects, this approach leads 
to very loose and trivial bounds in many cases. Therefore, in 
our experiments shown in section \VII\ we compare simulation 



results to the theoretical bound derived using equations (13) 
and (TTJI). 



to ensure all entries of are less than Q. 

As a result, since there are u'"' vectors u with integer entries 



Now, what remains to do is to find an expression for St and 
S't as a function of \£t\. The following lemma will provide us 
with the required relationship. 

Lemma 3. The average neighborhood size St in iteration t 
is given by: 

(21) 



between and v — 1, we will have v 



St^' 



where d is the average degree for pattern nodes. 

Proof: The proof is given in Appendix [A] ■ 
VI. Pattern Retrieval Capacity 

It is interesting to see that, except for its obvious influence 
on the learning time, the number of patterns C does not have 
any effect in the learning or recall algorithm. As long as the 
patterns come from a subspace, the learning algorithm will 
yield a matrix which is orthogonal to all of the patterns in the 
training set. And in the recall phase, all we deal with is Wz, 
with z being the noise which is independent of the patterns. 

Therefore, in order to show that the pattern retrieval capacity 
is exponential with n, all we need to show is that there exists 
a "valid" training set X with C patterns of length n for which 
C (X a^", for some a > 1 and < r. By valid we mean 
that the patterns should come from a subspace with dimension 
k < n and the entries in the patterns should be non-negative 
integers. The next theorem proves the desired result. 

Theorem 4. Let X be a C x n matrix, formed by C vectors 
of length n with non-negative integers entries between and 
Q — 1. Furthermore, let k ^ rn for some < r < 1. Then, 
there exists a set of such vectors far which C — a™, with 
a > 1, and rank(X) = k < n. 

Proof: The proof is based on construction: we construct 
a data set X with the required properties. To start, consider a 
matrix G G M*^^" with rank k and k — rn, with < r < 1. 
Let the entries of G be non-negative integers, between and 
7 — 1, with 7 > 2. 

We start constructing the patterns in the data set as follows: 
consider a set of random vectors e M'^, /i = 1, . . . , C, with 
integer- valued entries between and v — 1, where v > 2. We 
set the pattern G A" to be — u'^ ■ G, if all the entries 
of are between and Q—1. Obviously, since both u'' and 
G have only non-negative entries, all entries in a;'' are non- 
negative. Therefore, it is the Q—1 upper bound that we have 
to worry about. 

The J*'* entry in is equal to x^ — u'^ ■ Gj, where Gj is 
the j*'* column of G. Suppose Gj has dj non-zero elements. 
Then, we have: 

x^^u^-G,<d,ij-l){v-l) 

Therefore, denoting d* = maxj dj, we could choose 7, v 
and d* such that 



patterns forming 



X. Which means G — 



which would be an exponential 



number in n if u > 2. ■ 
As an example, if G can be selected to be a sparse 200 x 400 
matrix with 0/1 entries (i.e. 7 = 2) and d* = 10, and u is 
also chosen to be a vector with 0/1 elements (i.e. v = 2), then 
it is sufficient to choose Q > 11 to have a pattern retrieval 
capacity of C = 2"\ 



Q-l>d*{-f-l){v-l) 



(22) 



Remark 6. Note that inequality ( 22 i was obtained for the 
worst-case scenario and in fact is very loose. Therefore, even if 
it does not hold, we will still be able to memorize a very large 
number of patterns since a big portion of the generated vectors 
x^ will have entries less than Q. These vectors correspond 
to the message vectors u'^ that are "sparse" as well, i.e. do 
not have all entries greater than zero. The number of such 
vectors is a polynomial in n, the degree of which depends on 
the number of non-zero entries in u^. 

VII. Simulation Results 

A. Simulation Scenario 

We have simulated the proposed learning and recall algo- 
rithms for three different network sizes n = 200, 400, 800, 
with k ~ n/2 for all cases. For each case, we considered a 
few different setups with different values for a, rj, and 6 in the 
learning algorithm [T[ and different for the Majority- Voting 
recall algorithm [3] For brevity, we do not report all the results 
for various combinations but present only a selection of them 
to give insight on the performance of the proposed algorithms. 

In all cases, we generated 50 random training sets using the 
approach explained in the proof of theorem|4] i.e. we generated 
a generator matrix G at random with 0/1 entries and d* — 10. 
We also used 0/1 generating message words u and put Q = 11 
to ensure the validity of the generated training set. 

However, since in this setup we will have 2^ patterns to 
memorize, doing a simulation over all of them would take 
a lot of time. Therefore, we have selected a random sample 
sub-set X each time with size C = 10^ for each of the 50 
generated sets and used these subsets as the training set. 

For each setup, we performed the learning algorithm and 
then investigated the average sparsity of the learned constraints 
over the ensemble of 50 instances. As explained earlier, all the 
constraints for each network were learned in parallel, i.e. to 
obtain m — n — k constraints, we executed Algorithm [T] from 
random initial points m time. 

As for the recall algorithms, the error correcting perfor- 
mance was assessed for each set-up, averaged over the en- 
semble of 50 instances. The empirical results are compared to 
the theoretical bounds derived in Section |V] as well. 

B. Learning Phase Results 

In the learning algorithm, we pick a pattern from the training 
set each time and adjust the weights according to Algorithm 
[T] Once we have gone over all the patterns, we repeat this 



operation several times to make sure that update for one pattern 
does not adversely affect the other learned patterns. Let t be 
the iteration number of the learning algorithm, i.e. the number 
of times we have gone over the training set so far. Then 
we set at oc ao/t to ensure the conditions of Theorem [T| 
is satisfied. Interestingly, all of the constraints converged in at 
most two learning iterations for all different setups. Therefore, 
the learning is very fast in this case. 

Figure [3] illustrates the percentage of pattern nodes with the 
specified sparsity measure defined as g — k/u, where k is the 
number of non-zero elements. From the figure we notice two 
trends. The first is the effect of sparsity threshold, which as it 
is increased, the network becomes sparser The second one is 
the effect of network size, which as it grows, the connections 
become spai^ser. 
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Fig. 3. The percentage of variable nodes witli tlie specified sparsity measure 
and different values of network sizes and sparsity thresholds. The sparsity 
measure is defined as q = k/ti, where k is the number of non-zero elements. 



C. Recall Phase Results 

For the recall phase, in each trial we pick a pattern randomly 
from the training set, corrupt a given number of its symbols 
with ±1 noise and use the suggested algorithm to correct 
the errors. A pattern error is declared if the output does not 
match the correct pattern. We compare the performance of the 
two recall algorithms: Winner- Take-All (WTA) and Majority- 



Voting (MV). Table VII-C shows the simulation parameters in 
the recall phase for all scenarios (unless specified otherwise). 



TABLE I 
Simulation parameters 



Parameter 




^max 


e 


V 


Value 


1 


20|i2||o 


0.001 


1 



recall phase. Here, we have n — 400 and k = 200. Two 
different sparsity thresholds are compared together, namely 
dt cx 0.031/i and 9t cx 0.021/t. Clearly, as network becomes 
sparser, i.e. 6 increases, the performance of both recall algo- 
rithms improve. 

In Figure |5] we have investigated the effect of network 
size on the performance of recall algorithms by comparing 
the pattern error rates for two different network size, namely 
n = 800 and n = 400 with k — n/2m both cases. As obvious 
from the figure, the performance improves to a great extent 
when we have a larger network. This is partially because of 
the fact that in larger networks, the connections are relatively 
sparser as well. 

Figure |6] compares the results obtained in simulation with 
the upper bound derived in Section |V] Note that as expected, 
the bound is quite loose since in deriving inequality ( 1 8 1 we 
only considered the first iteration of the algorithm. 

We have also investigated the tightness of the bound given in 
equation ( [T9] l with simulation results. To this end, we compare 
Pe(l) and limt^ao Pe{t) in our simulations for the case of 
±1 noise. Figure |7] illustrates the result and it is evident 
that allowing the recall algorithm to iterate improves the final 
probability of error to a great extent. 

Finally, we investigate the performance of the modified 
more practical version of the Majority-Voting algorithm, which 
was explained in Section |IV-B| Figure [8] compares the per- 
formance of the WTA and original MV algorithms with the 
modified version of MV algorithm for a network with size 
n = 200, k — 100 and learning parameters at cx 0.45/i, 
77 — 0.45 and dt cx 0.015/<. The neural graph of this 
particular example is rather dense, because of small n and 
sparsity threshold 6. Therefore, here the modified version 
of the Majority-Voting algorithm performs better because of 
the extra information provided by the ^i-norm (than the (q- 




Figure |4] illustrates the effect of the sparsity threshold 9 
on the performance of the error correcting algorithm in the 
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Fig. 4. Pattern error rate against the initial number of erroneous nodes for 
two different values of Bq. Here, the network size is n = 400 and k = 200. 
The blue curves correspond to the sparser network (larger 80) and clearly 
show a better performance. 
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Fig. 5. Pattern error rate against the initial number of erroneous nodes for 
two different network sizes n = 800 and k = 400. In botli cases k = n/2. 



norm in the original version of the Majority-Voting algorithm). 
However, note that we did not observe this trend for the other 
simulation scenarios where the neural graph was sparser. 

VIII. Conclusions and Future Works 

In this paper, we proposed a neural associative memory 
which is capable of exploiting inherent redundancy in input 
patterns to enjoy an exponentially large pattern retrieval ca- 
pacity. Furthermore, the proposed method uses simple iter- 
ative algorithms for both learning and recall phases which 
makes gradual learning possible and maintain rather good 



recall performances. The convergence of the proposed learning 
algorithm was proved using techniques from stochastic ap- 
proximation. We also analytically investigated the performance 
of the recall algorithm by deriving an upper bound on the 
probability of recall error as a function of input noise. Our 
simulation results confirms the consistency of the theoretical 
results with those obtained in practice, for different network 
sizes and learning/recall parameters. 

Improving the eiTor correction capabilities of the proposed 
network is definitely a subject of our future research. We 
have already started investigating this issue and proposed a 
different network structure which reduces the error correction 
probability by a factor of 10 in many cases | [22) . We are 
working on different structures to obtain even more robust 
recall algorithms. 

Extending this method to capture other sorts of redundancy, 
i.e. other than belonging to a subspace, will be another topic 
which we would like to explore in future. 

Finally, considering some practical modifications to the 
learning and recall algorithms is of great interest. One good 
example is simultaneous learn and recall capability, i.e. to have 
a network which learns a subset of the patterns in the subspace 
and move immediately to the recall phase. Now during the 
recall phase, if the network is given a noisy version of the 
patterns previously memorized, it eliminates the noise using 
the algorithms described in this paper. However, if it is a new 
pattern, i.e. one that we have not learned yet, the network 
adjusts the weights in order to learn this pattern as well. 
Such model is of practical interest and closer to real-world 
neuronal networks. Therefore, it would be interesting to design 
a network with this capability while maintaining good error 
correcting capabilities and large pattern retrieval capacities. 
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Fig. 6. Pattern error rate against tlie initial number of erroneous nodes and 
comparison with theoretical upper bounds for n = 800, k = 400, oq = 0.95 
and 00 = 0.029. 



Fig. 7. Pattern en'or rate in the first and last iterations against the initial 
number of erroneous nodes for n = 800, k = 400, oq = 0.95, 9o = 0.029 
and = 0.99. 
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Fig. 8. Pattern error rate against the initial number of erroneous nodes for 
two different values of Bq. Here, the network size is n = 400 and k = 200. 
The blue curves correspond to the sparser network (larger 9o) and clearly 
show a better peii'ormance. 
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Appendix A 
Average neighborhood size 

In this appendix, we find an expression for the average 
neighborhood size for erroneous nodes, St = E{\Af{£t)\). 
Towards this end, we assume the following procedure for 
constructing a right-irregular bipartite graph: 

• In each iteration, we pick a variable node x with a de- 
gree randomly determined according to the given degree 
distribution. 

• Based on the given degree dx, we pick constraint 
nodes uniformly at random with replacement and connect 
X to the constraint node. 

• We repeat this process n times, until all variable nodes 
are connected. 

Note that the assumption that we do the process with re- 
placement is made to simplify the analysis. This assumption 
becomes more exact as n grows. 

Having the above procedure in mind, we will find an 
expression for the average number of constraint nodes in each 
construction round. More specifically, we will find the average 
number of constraint nodes connected to i pattern nodes at 
round i of construction. This relationship will in turn yields the 
average neighborhood size of \£t\ erroneous nodes in iteration 



t of error correction algorithm described in section IV 



With some abuse of notations, let Se denote the number 
of constraint nodes connected to pattern nodes in round e 



of construction procedure mentioned above. We write 
recursively in terms of e as follows: 



Se+l — Ed^{ 



E 



1 - 



dx{l - Se/m)) 



(23) 



Where d — E^j^jdj.} is the average degree of the pattern 
nodes. In words, the first line calculates the average growth 
of the neighborhood when a new variable node is added to 
the graph. The proceeding equalities directly follows from 
relationship on binomial sums. Noting that 5*1 = d, one 
obtains: 

St = m(l-{l^-fA (24) 
\ m J 

In order to verify the correctness of the above analysis, we 
have performed some simulations for different network sizes 
and degree distributions obtained from the graphs returned 
by the learning algorithm. We generated 100 random graphs 
and calculated the average neighborhood size in each iteration 
over these graphs. Furthermore, two different network sizes 
were considered n — 100,200 and m = n/2 in all cases, 
where n and m are the number of pattern and constraint 
nodes, respectively. The result for n — 100, to = 50 is 
shown in Figure |9] where the average neighborhood size in 
each iteration is illustrated and compared with theoretical 



estimations given by equation (24i. Figure 10 shows similar 



results for n ~ 200, m = 100. In the figure,the dashed line 
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Fig. 9. The theoretical estimation and simulation results for the neighborhood 
size of irregular graphs with a given degree-distribution for n = 100, m = 50 
and over 2000 random graphs. 



shows the average neighborhood size over these graphs. The 
solid line corresponds to theoretical estimations. It is obvious 
that the theoretical value is an exact approximation of the 
simulation results. 
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Fig. 10. The theoretical estimation and simulation results for the neighbor- 
hood size of irregular graphs with a given degree-distribution for n = 200, 
m = 100 and over 2000 random graphs. 



Appendix B 
Expander Graphs 

This section contains the definitions and the necessary 
background on expander graphs. 

Definition 2. A regular (dp, dc, n, m) bipartite graph W is a 
bipartite graph between n pattern nodes of degree dp and m 
constraint nodes of degree dc. 

Definition 3. An (an, /3 dp) -expander is a (dp, d^ n, m) bipar- 
tite graph such that for any subset V of pattern nodes with 
\V\ < an we have \Af{V)\ > /3dp|7'| where M{V) is the set 
of neighbors of V among the constraint nodes. 

The following result from |31J shows the existence of 
families of expander graphs with parameter values that are 
relevant to us. 

Theorem 5. f3V] Let W be a randomly chosen 
{dp, dc)— regular bipartite graph between n dp— regular ver- 
tices and m = (dp/dc) dc— regular vertices. Then for all 
< a < 1, with high probability, all sets of an dp— regular 
vertices in W have at least 

\dc V logae J 

neighbors, where h{-) is the binary entropy function. 

The following result from p3] | shows the existence of 
families of expander graphs with parameter values that are 
relevant to us. 

Theorem 6. Let dc, dp, m, n be integers, and let j3 < 1 — 1/ dp. 
There exists a small a > such that if W is a {dp, dc, n, m) 
bipartite graph chosen uniformly at random from the ensemble 
of such bipartite graphs, then W is an {an, (3dp)-expander 



with probability 1 — o{l), where o{l) is a term going to zero 
as n goes to infinity. 

Appendix C 

Analysis of the Recall Algorithms for Expander 
Graphs 

A. Analysis of the Winner-Take -All Algorithm 

We prove the error correction capability of the winner- take- 
all algorithm in two steps: first we show that in each iteration, 
only pattern neurons that are corrupted by noise will be chosen 
by the winner-take-all strategy to update their state. Then, 
we prove that the update is in the right direction, i.e. toward 
removing noise from the neurons. 

Lemma 7. If the constraint matrix W is an {an, (3dp) 
expander, with /3 > 1/2, and the original number of erroneous 
neurons are less than or equal to 2, then in each iteration 
of the winner-take-all algorithm only the corrupted pattern 
nodes update their value and the other nodes remain intact. 
For l3 — the algorithm will always pick the correct node 
if we have two or fewer erroneous nodes. 

Proof If we have only one node Xi in error, it is 
obvious that the corresponding node will always be the winner 
of the winner-take-all algorithm unless there exists another 
node that has the same set of neighbors as Xi. However, 
this is impossible as because of the expansion properties, the 
neighborhood of these two nodes must have at least 2/3dp 
members which for /3 > 1/2 is strictly greater than dp. As a 
result, no two nodes can have the same neighborhood and the 
winner will always be the correct node. 

In the case where there are two erroneous nodes, say 
Xi and Xj, let £ be the set {xi,Xj} and JV{£) be the 
corresponding neighborhood on the constraint nodes side. 
Furthermore, assume Xi and xj share dp/ of their neighbors 
so that |A/'(£)| — 2dp — dpi. Now because of the expansion 
properties: 

\N{£)\ = 2dp - dp, > 2f5dp =^ dpi < 2(1 - I3)dp. 

Now we have to show that there are no nodes other than Xi 
and Xj that can be the winner of the winner-take-all algorithm. 
To this end, note that only those nodes that are connected 
to N{£) will receive some feedback and can hope to be the 
winner of the process. So let's consider such a node x^ that 
is connected to dp^ of the nodes in N{£). Let £' he £U {xg} 
and N{£') be the corresponding neighborhood. Because of the 
expansion properties we have |iV(£')| = dp — dp,, + \N{£)\ > 
3f3dp. Thus: 

dp, < dp + \N{£)\-3l3dp = 3dp{l- l3)-dpi. 

Now, note that the nodes xi and xj will receive some feedback 
from 2dp — dp' edges because we assume there is no noise 
cancellation due to the fact that neural weights are real-valued 
and noise entries are integers. Since 2dp — dp' > 3dp(l — 
/?) — dp' for /? > 1/2, we conclude that dp — dp' > dp, 
which proves that no node outside £ can be picked during the 
winner-take-all algorithm as long as |£| < 2 for (3 > 1/2. ■ 



In the next lemma, we show that the state of erroneous 
neurons is updated in the direction of reducing the noise. 

Lemma 8. If the constraint matrix W is an {an, pdp) 
expander, with /? > 3/4, and the original number of erroneous 
neurons is less than or equal e,nin = 2, then in each iteration 
of the winner-take-all algorithm the winner is updated toward 
reducing the noise. 

Proof When there is only one erroneous node, it is 
obvious that all its neighbors agree on the direction of update 
and the node reduces the amount of noise by one unit. 



setting <y5 = I will guarantee that only the node in error will 
be updated, and that the direction of this update is towards 
reducing the noise. 

b) Case 2: Now suppose that two distinct nodes xi and 
Xj are in error Let £ — {xi , Xj }, and let Xi and xj share dp' 
common neighbors. If the noise corrupting these two pattern 
nodes, denoted by Zi and Zj, are such that sign(2i) — sign(2j), 
then both Xi and Xj receive — sign(zi) along all dp edges 
that they are connected to during the backward iteration. Now 
suppose that sign(zi) 7^ sign(zj). Then Xi (xj) receives correct 



feedback from at least the dp — dpi edges in J\f{{xi})\£ 
If there are two nodes Xi and Xj in eiTor, since the number of (resp. Af{{xj})\£) during the backward iteration. Therefore, if 

dp' < dp/ 2, the direction of update would be also correct and 
the feedback will reduce noise during the update. And from 



their shared neighbors is less than 2(1 — /3)dp (as we proved in 
the last lemma), then more than half of their neighbors would 
be unique if /3 > 3/4. These unique neighbors agree on the 
direction of update. Therefore, whoever the winner is will be 
updated to reduce the amount of noise by one unit. ■ 
The following theorem sums up the results of the previous 
lemmas to show that the winner-take-all algorithm is guaran- 
teed to perform error correction. 

Theorem 7. If the constraint matrix W is an {an, pdp) 
expander, with j3 > 3/4, then the winner-take-all algorithm 
is guaranteed to correct at least Cmin — 2 positions in error, 
irrespective of the magnitudes of the errors. 

Proof: The proof is immediate from Lemmas |7] and [8] ■ 

B. Analysis of the Majority Algorithm 

Roughly speaking, one would expect the Majority- Voting 
algorithm to be sub-optimal in comparison to the winner- 
take-all strategy, since the pattern neurons need to make inde- 
pendent decisions, and are not allowed to cooperate amongst 
themselves. In this subsection, we show that despite this 
restriction, the Majority- Voting algorithm is capable of error 
correction; the sub-optimality in comparison to the winner- 
take-all algorithm can be quantified in terms of a larger 
expansion factor j3 being required for the graph. 

Theorem 8. // the constraint matrix W is an {an, (3dp) 
expander with /? > |, then the Majority-Voting algorithm with 
(p = ^ is guaranteed to correct at least two positions in error, 
irrespective of the magnitudes of the errors. 

Proof: As in the proof for the winner-take-all case, we 
will show our result in two steps: first, by showing that for a 
suitable choice of the Majority-Voting threshold ip, that only 
the positions in error are updated in each iteration, and that 
this update is towards reducing the effect of the noise. 

a) Case 1: First consider the case that only one pattern 
node Xi is in error Let Xj be any other pattern node, for some 
j ^ i. Let Xi and Xj have dp' neighbors in common. As argued 
in the proof of Lemma |7] we have that 

dp' <2dp{l- 13). (25) 

Hence for (3 — ^, Xi receives non-zero feedback from at least 
I dp constraint nodes, while Xj receives non-zero feedback 



equation (25 1 we know that for fi = 4/5, dp' < 2dp/5 < dp/2. 



Therefore, the two noisy nodes will be updated towards the 
correct direction. 

Let us now examine what happens to a node X£ that is 
different from the two erroneous nodes Xi,Xj. Suppose that 
xi is connected to dp,, nodes in J\f{£). From the proof of 
Lemma |7] we know that 



< 
< 



3dp(l 
3dp(l 



Hence xg receives at most 3dp{l — j3) non-zero messages 
during the backward iteration. 

For /3 > |, we have that dp - 2dp{\ - P) > 'idp{l - /3). 
Hence by setting /3 = | and ip = [dp — 2dp{l — (i)\/dp — |, 
it is clear from the above discussion that we have ensured the 
following in the case of two erroneous pattern nodes: 

• The noisy pattern nodes are updated towards the direction 
of reducing noise. 

• No pattern node other than the erroneous pattern nodes 
is updated. 



C. Minimum Distance of Patterns 

Next, we present a sufficient condition such that the mini- 
mum Hamming distanc^ between these exponential number 
of patterns is not too small. In order to prove such a result, 
we will exploit the expansion properties of the bipartite graph 
W; our sufficient condition will be in terms of a lower bound 
on the parameters of the expander graph. 

Theorem 9. Let W be a {dp, dc,n,m)— regular bipartite 
graph, that is an {an, (5 dp) expander Let X be the set of 
patterns corresponding to the expander weight matrix W. If 



P> 7; + 



Adr, 



from at most |c?p constraint nodes. In this case, it is clear that each other on all but d coordinates. 



then the minimum distance between the patterns is at least 

[an\ + L 

^Two (possibly non-binary) n— length vectors x and y are said to be at 
a Hamming distance d from each other if they are coordinate-wise equal to 



Proof: Let d be less than an, and Wi denote the i"* 
column of W . If two patterns are at Hamming distance d from 
each other, then there exist non-zero integers ci , C2 , . . . , 
such that 



(26) 



where ii,...,id are distinct integers between 1 and n. Let 
V denote any set of pattern nodes of the graph represented 
by W, with \r\ = d. As in ||32), we divide I\f{V) into 
two disjoint sets: Munique{V) is the set of nodes in M{V) 
that are connected to only one edge emanating from V, and 
MsharediV) compriscs the remaining nodes of M{V) that are 
connected to more than one edge emanating from V. If we 
show that \Numque{V)\ > for all V with \V\ ^ d, then ^ 
cannot hold, allowing us to conclude that no two patterns with 



distance d exist. Using the arguments in |32 Lemma 1], we 
obtain that 

\M„e{V)\ > 2dp\V\ (^13-^ 
Hence no two patterns with distance d exist if 

''^'{^'l)>'^^>l + 2^d- 

By choosing f3 > ^ + we can hence ensure that the 
minimum distance between patterns is at least [cmj +1. ■ 

D. Choice of Parameters 

In order to put together the results of the previous two 
subsections and obtain a neural associative scheme that stores 
an exponential number of patterns and is capable of error 
correction, we need to carefully choose the various relevant 
parameters. We summarize some design principles below. 

• From Theorems |6] and |9j the choice of /3 depends on dp, 
according to^ + ^<l3<l-j^ 

• Choose dc,Q,v,^ so that Theoreml4] yields an exponen- 
tial number of patterns. 

• For a fixed a, n has to be chosen large enough so that an 
{an,j5dp) expander exists according to Theorem |6] with 
;9 > 3/4 and so that an/ 2 > emin = 2. 

Once we choose a judicious set of parameters according to 
the above requirements, we have a neural associative memory 
that is guaranteed to recall an exponential number of patterns 
even if the input is corrupted by errors in two coordinates. Our 
simulation results will reveal that a greater number of errors 
can be connected in practice. 
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